9 research outputs found
Corpus Creation for Sentiment Analysis in Code-Mixed Tamil-English Text
Understanding the sentiment of a comment from a video or an image is an
essential task in many applications. Sentiment analysis of a text can be useful
for various decision-making processes. One such application is to analyse the
popular sentiments of videos on social media based on viewer comments. However,
comments from social media do not follow strict rules of grammar, and they
contain mixing of more than one language, often written in non-native scripts.
Non-availability of annotated code-mixed data for a low-resourced language like
Tamil also adds difficulty to this problem. To overcome this, we created a gold
standard Tamil-English code-switched, sentiment-annotated corpus containing
15,744 comment posts from YouTube. In this paper, we describe the process of
creating the corpus and assigning polarities. We present inter-annotator
agreement and show the results of sentiment analysis trained on this corpus as
a benchmark
DravidianCodeMix: Sentiment Analysis and Offensive Language Identification Dataset for Dravidian Languages in Code-Mixed Text
This paper describes the development of a multilingual, manually annotated
dataset for three under-resourced Dravidian languages generated from social
media comments. The dataset was annotated for sentiment analysis and offensive
language identification for a total of more than 60,000 YouTube comments. The
dataset consists of around 44,000 comments in Tamil-English, around 7,000
comments in Kannada-English, and around 20,000 comments in Malayalam-English.
The data was manually annotated by volunteer annotators and has a high
inter-annotator agreement in Krippendorff's alpha. The dataset contains all
types of code-mixing phenomena since it comprises user-generated content from a
multilingual country. We also present baseline experiments to establish
benchmarks on the dataset using machine learning methods. The dataset is
available on Github
(https://github.com/bharathichezhiyan/DravidianCodeMix-Dataset) and Zenodo
(https://zenodo.org/record/4750858\#.YJtw0SYo\_0M).Comment: 36 page
Corpus creation for sentiment analysis in code-mixed Tamil-English text
Understanding the sentiment of a comment from a video or an image is an essential task in many applications. Sentiment analysis
of a text can be useful for various decision-making processes. One such application is to analyse the popular sentiments of videos
on social media based on viewer comments. However, comments from social media do not follow strict rules of grammar, and they
contain mixing of more than one language, often written in non-native scripts. Non-availability of annotated code-mixed data for a
low-resourced language like Tamil also adds difficulty to this problem. To overcome this, we created a gold standard Tamil-English
code-switched, sentiment-annotated corpus containing 15,744 comment posts from YouTube. In this paper, we describe the process of
creating the corpus and assigning polarities. We present inter-annotator agreement and show the results of sentiment analysis trained on
this corpus as a benchmarkThis publication has emanated from research supported
in part by a research grant from Science Foundation Ireland (SFI) under Grant Number SFI/12/RC/2289
(Insight), SFI/12/RC/2289 P2 (Insight 2), co-funded by
the European Regional Development Fund as well
as by the EU H2020 programme under grant agreements 731015 (ELEXIS-European Lexical Infrastructure), 825182 (Pret- ˆ a-LLOD), and Irish Research Council `
grant IRCLA/2017/129 (CARDAMOM-Comparative Deep
Models of Language for Minority and Historical Languages).non-peer-reviewe
Corpus creation for sentiment analysis in code-mixed Tamil-English text
Understanding the sentiment of a comment from a video or an image is an essential task in many applications. Sentiment analysis
of a text can be useful for various decision-making processes. One such application is to analyse the popular sentiments of videos
on social media based on viewer comments. However, comments from social media do not follow strict rules of grammar, and they
contain mixing of more than one language, often written in non-native scripts. Non-availability of annotated code-mixed data for a
low-resourced language like Tamil also adds difficulty to this problem. To overcome this, we created a gold standard Tamil-English
code-switched, sentiment-annotated corpus containing 15,744 comment posts from YouTube. In this paper, we describe the process of
creating the corpus and assigning polarities. We present inter-annotator agreement and show the results of sentiment analysis trained on
this corpus as a benchmarkThis publication has emanated from research supported
in part by a research grant from Science Foundation Ireland (SFI) under Grant Number SFI/12/RC/2289
(Insight), SFI/12/RC/2289 P2 (Insight 2), co-funded by
the European Regional Development Fund as well
as by the EU H2020 programme under grant agreements 731015 (ELEXIS-European Lexical Infrastructure), 825182 (Pret- ˆ a-LLOD), and Irish Research Council `
grant IRCLA/2017/129 (CARDAMOM-Comparative Deep
Models of Language for Minority and Historical Languages)
Offensive language identification in dravidian languages using MPNet and CNN
Social media has effectively replaced traditional forms of communication and marketing. As these platforms allow for the free expression of ideas and facts through text, images, and videos, there exists a significant need to screen them to safeguard people and organisations from objectionable information directed at them. Our work aims to categorise code-mixed social media comments and posts in Tamil, Malayalam, and Kannada into offensive or not offensive at different levels. We present a multilingual MPNet and CNN fusion model for detecting offensive language content directed at an individual (or group) in low-resource Dravidian languages at different levels. Our model is capable of handling data that has been code-mixed, such as Tamil and Latin scripts. The model was successfully validated on the datasets, achieving offensive language detection results better than those of other baseline models with weighted average F1-score of 0.85, 0.98, and 0.76, and performed better than the baseline models EWDT, and EWODT by 0.02, 0.02, 0.04 for Tamil, Malayalam, and Kannada respectively
Multilingual multimodal machine translation for Dravidian languages utilizing phonetic transcription
Multimodal machine translation is the task
of translating from a source text into the
target language using information from
other modalities. Existing multimodal
datasets have been restricted to only highly
resourced languages. In addition to that,
these datasets were collected by manual
translation of English descriptions from
the Flickr30K dataset. In this work, we
introduce MMDravi, a Multilingual Multimodal dataset for under-resourced Dravidian languages. It comprises of 30,000 sentences which were created utilizing several
machine translation outputs. Using data
from MMDravi and a phonetic transcription of the corpus, we build an Multilingual
Multimodal Neural Machine Translation
system (MMNMT) for closely related Dravidian languages to take advantage of multilingual corpus and other modalities. We
evaluate our translations generated by the
proposed approach with human-annotated
evaluation dataset in terms of BLEU, METEOR, and TER metrics. Relying on
multilingual corpora, phonetic transcription, and image features, our approach improves the translation quality for the underresourced languages.This work is supported by a research grant from
Science Foundation Ireland, co-funded by the European Regional Development Fund, for the Insight Centre under Grant Number SFI/12/RC/2289
and the European Union’s Horizon 2020 research
and innovation programme under grant agreement
No 731015, ELEXIS - European Lexical Infrastructure and grant agreement No 825182, Pret- ˆ a-`
LLOD.non-peer-reviewe
Multilingual multimodal machine translation for Dravidian languages utilizing phonetic transcription
Multimodal machine translation is the task
of translating from a source text into the
target language using information from
other modalities. Existing multimodal
datasets have been restricted to only highly
resourced languages. In addition to that,
these datasets were collected by manual
translation of English descriptions from
the Flickr30K dataset. In this work, we
introduce MMDravi, a Multilingual Multimodal dataset for under-resourced Dravidian languages. It comprises of 30,000 sentences which were created utilizing several
machine translation outputs. Using data
from MMDravi and a phonetic transcription of the corpus, we build an Multilingual
Multimodal Neural Machine Translation
system (MMNMT) for closely related Dravidian languages to take advantage of multilingual corpus and other modalities. We
evaluate our translations generated by the
proposed approach with human-annotated
evaluation dataset in terms of BLEU, METEOR, and TER metrics. Relying on
multilingual corpora, phonetic transcription, and image features, our approach improves the translation quality for the underresourced languages.This work is supported by a research grant from
Science Foundation Ireland, co-funded by the European Regional Development Fund, for the Insight Centre under Grant Number SFI/12/RC/2289
and the European Union’s Horizon 2020 research
and innovation programme under grant agreement
No 731015, ELEXIS - European Lexical Infrastructure and grant agreement No 825182, Pret- ˆ a-`
LLOD
Findings of the VarDial Evaluation Campaign 2021
This paper describes the results of the shared tasks organized as part of the VarDial Evaluation Campaign 2021. The campaign was part of the eighth workshop on Natural Language Processing (NLP) for Similar Languages, Varieties and Dialects (VarDial), co-located with EACL 2021. Four separate shared tasks were included this year: Dravidian Language Identification (DLI), Romanian Dialect Identification (RDI), Social Media Variety Geolocation (SMG), and Uralic Language Identification (ULI). DLI was organized for the first time and the other three continued a series of tasks from previous evaluation campaigns.Non peer reviewe